INTRODUCTION :

The purpose of the individual project is to clean two data-sets which will later allow us to compare them and combine their data in a way where we will be able to obtain important information. In the project we are required to use version control system called ‘git’ and create a report using R Markdown.

During the process of the project we will be working with 2 specific data sets (CSV files.)

These two data sets contain the number of bikes rented at each hour in each of the continents/states. In the beginning the two given data sets are “messy”/not clean . In the project we will apply Data Wrangling to the data, resulting in having a “tidy”/clean data set. Once the files are in the correct format, we will create different plots to explore the relationships between rented bikes in Washington, D.C., USA with Seoul, South Korea . We will also use the plots to examine if these numbers are determined by the weather conditions (Humidity, Temperature, Wind Speed). Once we visualize all these summaries, we will apply statistical analysis on the data to understand the relationship of the data even more and predict future outcomes.
The given techniques will be applied on the two CSV files data:

DATA WRANGLING :

Seoul, South Korea Data File beginning state

seoul_Bikes <- read.csv("BikeSeoul.csv")
head(seoul_Bikes)
##         Date Rented.Bike.Count Hour Temperature.C. Humidity... Wind.speed..m.s.
## 1 01/12/2017               254    0           -5.2          37              2.2
## 2 01/12/2017               204    1           -5.5          38              0.8
## 3 01/12/2017               173    2           -6.0          39              1.0
## 4 01/12/2017               107    3           -6.2          40              0.9
## 5 01/12/2017                78    4           -6.0          36              2.3
## 6 01/12/2017               100    5           -6.4          37              1.5
##   Visibility..10m. Dew.point.temperature.C. Solar.Radiation..MJ.m2.
## 1             2000                    -17.6                       0
## 2             2000                    -17.6                       0
## 3             2000                    -17.7                       0
## 4             2000                    -17.6                       0
## 5             2000                    -18.6                       0
## 6             2000                    -18.7                       0
##   Rainfall.mm. Snowfall..cm. Seasons    Holiday Functioning.Day
## 1            0             0  Winter No Holiday             Yes
## 2            0             0  Winter No Holiday             Yes
## 3            0             0  Winter No Holiday             Yes
## 4            0             0  Winter No Holiday             Yes
## 5            0             0  Winter No Holiday             Yes
## 6            0             0  Winter No Holiday             Yes

Above we can see the beginning state of the file(“BikeSeoul.csv”). We can notice that their are some issues with the Data Structure of the data set.

  • Column names are too long [Rented.Bike.Count, Solar.Radiation..MJ.m2., …]
  • Unclear column names (too much notation that creates confusion) [Dew.point.temperature.C. ,]
  • Values are too long and take more time to code [‘No Holiday’]
These problems create clutter in the data and make it harder for us to understand it. In the first glimpse the data is difficult to read and extract estimates.



[TASK 1]

Our first task in our Data Wrangling is to remove all the rows that indicate a bike was not rented, after remove the column ‘Functioning Day’ which specifies this data.
Implemantation idea
Column ‘Functioning Day’ states if a bike was rented or not. We will have to filter through the rows and obtaining the records that have value “Yes” for attribute ‘Functioning Day’. These are the records that a bike was rented. Once we have obtained these rows we can remove the column ‘Functioning Day’.

## Number of rows before removing unneeded records are:
nrow(seoul_Bikes)
## [1] 8760
seoul_Bikes <- seoul_Bikes %>%
  filter(Functioning.Day != "No") %>% ## Collect rows with value Yes in column Functioning.Day
  select(-Functioning.Day) ## Remove column Functioning.Day
## Number of rows after removing unneeded records are:
nrow(seoul_Bikes)
## [1] 8465



[TASK 2]

Our second task is to rename some column names to more appropriate ones.
From our first glimpse of the table we can notice that some column names do not have appropriate names.

  • ‘Rented.Bike.Count’ to ‘Count’
  • ‘Temperature.C.’ to ‘Temperature’
  • ‘Wind.speed..m.s.’ to ‘WindSpeed’
  • ‘Seasons’ to ‘Season’
  • ‘Humidity…’ to ‘Humidity’
Implemantation idea
Using function rename() from tidyverse we can simply rename the column names to more appropriate ones.

## Header Names before Renaming
names(seoul_Bikes)
##  [1] "Date"                     "Rented.Bike.Count"       
##  [3] "Hour"                     "Temperature.C."          
##  [5] "Humidity..."              "Wind.speed..m.s."        
##  [7] "Visibility..10m."         "Dew.point.temperature.C."
##  [9] "Solar.Radiation..MJ.m2."  "Rainfall.mm."            
## [11] "Snowfall..cm."            "Seasons"                 
## [13] "Holiday"
seoul_Bikes <- seoul_Bikes %>%
  rename(Count = Rented.Bike.Count, ## Using function we are renaming the columns.
         Temperature = Temperature.C., 
         WindSpeed = Wind.speed..m.s., 
         Season = Seasons,
         Humidity = Humidity...)
## Header Names after Renaming
names(seoul_Bikes)
##  [1] "Date"                     "Count"                   
##  [3] "Hour"                     "Temperature"             
##  [5] "Humidity"                 "WindSpeed"               
##  [7] "Visibility..10m."         "Dew.point.temperature.C."
##  [9] "Solar.Radiation..MJ.m2."  "Rainfall.mm."            
## [11] "Snowfall..cm."            "Season"                  
## [13] "Holiday"



[TASK 3]

Our third task is to convert column Date from type ‘character’ to type to ‘Date’.
Implemantation idea
With the help of tidyverse function ‘as_date()’ we will change the type/class of the attribute.

## Printing class of column Date before mutation.
class(seoul_Bikes$Date)
## [1] "character"
seoul_Bikes <- seoul_Bikes %>%
  convertDate("dmy")
## Printing class of column Date after mutation.
class(seoul_Bikes$Date)
## [1] "Date"



[TASK 4]

Our fourth task is to add a new column with name ‘FullDate’ which will hold the Date and Hour of the day the bike was rented in format ‘Y-M-D H:M:S’
Implemantation idea
With help of package lubridate we have a function called ‘make_datetime’ which does this action for us in the format asked.

## Printing column names to show that FullDate does not exist.
names(seoul_Bikes)
##  [1] "Date"                     "Count"                   
##  [3] "Hour"                     "Temperature"             
##  [5] "Humidity"                 "WindSpeed"               
##  [7] "Visibility..10m."         "Dew.point.temperature.C."
##  [9] "Solar.Radiation..MJ.m2."  "Rainfall.mm."            
## [11] "Snowfall..cm."            "Season"                  
## [13] "Holiday"
## Calling a self-declared function to undertake this task.
seoul_Bikes <- seoul_Bikes %>%
  createFullDate()
## Printing column names to show creation of FullDate.
names(seoul_Bikes)
##  [1] "Date"                     "Count"                   
##  [3] "Hour"                     "Temperature"             
##  [5] "Humidity"                 "WindSpeed"               
##  [7] "Visibility..10m."         "Dew.point.temperature.C."
##  [9] "Solar.Radiation..MJ.m2."  "Rainfall.mm."            
## [11] "Snowfall..cm."            "Season"                  
## [13] "Holiday"                  "FullDate"
## Printing first 6 rows of FullDate to show the value is in correct format.
head(seoul_Bikes$FullDate)
## [1] "2017-12-01 00:00:00 UTC" "2017-12-01 01:00:00 UTC"
## [3] "2017-12-01 02:00:00 UTC" "2017-12-01 03:00:00 UTC"
## [5] "2017-12-01 04:00:00 UTC" "2017-12-01 05:00:00 UTC"



[TASK 5]

Our fifth task is to change Holiday values:

  • ‘No Holiday’ to ‘No’
  • ‘Holiday’ to ‘Yes’

Additionally to change the column type/class from ‘character’ to ‘factor’.
The reason we want to apply this change is to be able order our data to the bikes rented during a holiday and the onces rented when it was not a holiday. Factor type allows us to categorize data.
Implemantation idea
Using tidyverse function mutate in combination with ifelse conditional statement we will change the record values. Once the inputs have changed we create it to factors/category with the order :

  • Yes > No

## Printing the 6 fist rows to show factor values before changing
head(seoul_Bikes$Holiday)
## [1] "No Holiday" "No Holiday" "No Holiday" "No Holiday" "No Holiday"
## [6] "No Holiday"
## Printing the class of Holiday to show its a 'character'
class(seoul_Bikes$Holiday)
## [1] "character"
seoul_Bikes <- seoul_Bikes %>%
  mutate(Holiday = ifelse(Holiday == "No Holiday", "No", "Yes")) %>% ## Changing values to Yes & No
  mutate(Holiday = factor(Holiday, levels = c("Yes", "No"))) ## Changing type to factor with seq order.
## Printing the 6 fist rows to show factor values have changed.
head(seoul_Bikes$Holiday)
## [1] No No No No No No
## Levels: Yes No
## Printing the class of Holiday to show its a 'factor'
class(seoul_Bikes$Holiday)
## [1] "factor"



[TASK 6]

Our sixth task is to change the order of the Factor levels from column Season to the order:

  • Spring
  • Summer
  • Autumn
  • Winter
From highest to lowest.
Implemantation idea
Using tidyverse ‘mutate()’ function, we will convert the column to a factor and specify the order of the levels we want.

## Printing class of the column to show its not a factor yet.
class(seoul_Bikes$Season)
## [1] "character"
seoul_Bikes <- seoul_Bikes %>%
  mutate(Season = factor(Season, levels = c("Spring", "Summer", "Autumn", "Winter")))
## Printing the 6 fist rows to show factor levels order after the change of class
head(seoul_Bikes$Season)
## [1] Winter Winter Winter Winter Winter Winter
## Levels: Spring Summer Autumn Winter



[TASK 7]

Our seventh and final task for cleaning the first data set is to removing unwanted columns :

  • visibility
  • dew point temperature
  • solar radiation
  • rainfall
  • snowfall
Implemantation idea
Using tidyverse function ‘select’ and the notation of the ‘-’ we will remove the unwanted columns

## Printing the column names of the table before removing unwanted data.
names(seoul_Bikes)
##  [1] "Date"                     "Count"                   
##  [3] "Hour"                     "Temperature"             
##  [5] "Humidity"                 "WindSpeed"               
##  [7] "Visibility..10m."         "Dew.point.temperature.C."
##  [9] "Solar.Radiation..MJ.m2."  "Rainfall.mm."            
## [11] "Snowfall..cm."            "Season"                  
## [13] "Holiday"                  "FullDate"
## Calling a self-declared function to undertaken given task.
seoul_Bikes <- seoul_Bikes %>%
  deleteColumn( c("Visibility..10m.",
                  "Solar.Radiation..MJ.m2.",
                  "Rainfall.mm.",
                  "Dew.point.temperature.C.",
                  "Snowfall..cm."))

## Printing column names and first 6 rows to show the end result of the data that has been cleaned.
names(seoul_Bikes)
## [1] "Date"        "Count"       "Hour"        "Temperature" "Humidity"   
## [6] "WindSpeed"   "Season"      "Holiday"     "FullDate"


After applying all these methods on the data set, we managed to clean the data and bring it to a ‘tidy’ state. Now the data is easier to read and manipulate in plotting and statistical analysis.

The result of Seoul, South Korea Data File after Data Wrangling is :
head(seoul_Bikes)
##         Date Count Hour Temperature Humidity WindSpeed Season Holiday
## 1 2017-12-01   254    0        -5.2       37       2.2 Winter      No
## 2 2017-12-01   204    1        -5.5       38       0.8 Winter      No
## 3 2017-12-01   173    2        -6.0       39       1.0 Winter      No
## 4 2017-12-01   107    3        -6.2       40       0.9 Winter      No
## 5 2017-12-01    78    4        -6.0       36       2.3 Winter      No
## 6 2017-12-01   100    5        -6.4       37       1.5 Winter      No
##              FullDate
## 1 2017-12-01 00:00:00
## 2 2017-12-01 01:00:00
## 3 2017-12-01 02:00:00
## 4 2017-12-01 03:00:00
## 5 2017-12-01 04:00:00
## 6 2017-12-01 05:00:00



Washington DC, USA Data File beginning state

washington_Bikes <- read.csv("BikeWashingtonDC.csv")
head(washington_Bikes)
##   instant     dteday season yr mnth hr holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1  0       0       6          0          1
## 2       2 2011-01-01      1  0    1  1       0       6          0          1
## 3       3 2011-01-01      1  0    1  2       0       6          0          1
## 4       4 2011-01-01      1  0    1  3       0       6          0          1
## 5       5 2011-01-01      1  0    1  4       0       6          0          1
## 6       6 2011-01-01      1  0    1  5       0       6          0          2
##   temp  atemp  hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81    0.0000      3         13  16
## 2 0.22 0.2727 0.80    0.0000      8         32  40
## 3 0.22 0.2727 0.80    0.0000      5         27  32
## 4 0.24 0.2879 0.75    0.0000      3         10  13
## 5 0.24 0.2879 0.75    0.0000      0          1   1
## 6 0.24 0.2576 0.75    0.0896      0          1   1

Above we can see the beginning state of the file (“BikeWashingtonDC.csv”). We can notice that their are some issues with the Data Structure of the data set.

  • Column names are acronyms, users can distinguish differently [cnt, temp, atemp, hum, …]
  • Values are not clear [Column season has values (1,2,3,4), which number represents which season?]
  • Some values are in binary values which not evryone understands [holiday, workingday]
  • We have some kind of repeated data [dteday -> (yr,mnth,weekday)], [workingday is the same as holiday]
These problems create clutter in the data and make it harder for us to understand it. In the first glimpse the data is difficult to read and extract estimates.



[TASK 1]

Our first task in our Data Wrangling for the file (“BikeWashingtonDC.csv”) is to remove all some columns that are unwanted and create repeatition in our data.

  • instant (unique number for each record)
  • yr (year)
  • mnth (month)
  • weekday (day of the week)
  • workingday (if the specific day was a holiday or not)
  • weathersit (weather condition)
  • atemp (normalized feeling temparature)
  • casual (number of bikes rented by casul users)
  • registered (number of bikes rented by registered users)
Implemantation idea
Using function select() form tidyverse package and the notation ‘-’ we can select all the columns we want to remove.

## Columns in the file before removing
names(washington_Bikes)
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"
## Calling self-declared function to undertake given task 
washington_Bikes <- washington_Bikes %>%
  deleteColumn( c("instant",
                  "yr",
                  "mnth",
                  "weekday",
                  "workingday",
                  "weathersit",
                  "atemp",
                  "casual",
                  "registered"))
## Columns in file after removing.
names(washington_Bikes)
## [1] "dteday"    "season"    "hr"        "holiday"   "temp"      "hum"      
## [7] "windspeed" "cnt"



[TASK 2]

Our second task is to rename the remaining to columns to match exactly as the seoul file. Making the columns names exactly the same will help with creating a combination of the two tables (join), so it has to be the same case aswell.

  • ‘dteday’ to ‘Date’
  • ‘cnt’ to ‘Count’
  • ‘hr’ to ‘Hour’
  • ‘temp’ to ‘Temperature’
  • ‘hum’ to ‘Humidity’
  • ‘windspeed’ to ‘WindSpeed’
  • ‘season’ to ‘Season’
  • ‘holiday’ to ‘Holiday’
Implemantation idea
Using function rename() form tidyverse package we can rename the already existing column names.

## Header Names before Renaming
names(washington_Bikes)
## [1] "dteday"    "season"    "hr"        "holiday"   "temp"      "hum"      
## [7] "windspeed" "cnt"
washington_Bikes <- washington_Bikes %>%
  rename(Count = cnt, ## Using function we are renaming the columns.
         Temperature = temp, 
         WindSpeed = windspeed, 
         Season = season,
         Humidity = hum,
         Date = dteday,
         Hour = hr,
         Holiday = holiday)
## Header Names after Renaming
names(washington_Bikes)
## [1] "Date"        "Season"      "Hour"        "Holiday"     "Temperature"
## [6] "Humidity"    "WindSpeed"   "Count"



[TASK 3]

Our third task is to convert the Humidity field’s values to a percentage. At the moment the field Humidity in the two files are given two different measurements, we will have both in percentage after this task.

  • Value is divided by 100 at the moment, so we will need to multiply it by 100 to get it to a % value.
Implemantation idea
Using function mutate() from tidyverse package we can modify the existing columnn and its fields.

## Printing first 6 rows to show the value now in decimal
head(washington_Bikes$Humidity)
## [1] 0.81 0.80 0.80 0.75 0.75 0.75
washington_Bikes <- washington_Bikes %>%
  mutate(Humidity = Humidity * 100) ## multiply existing value with 100 to make a percentage value.
## Printing first 6 rows to show values out of a hundred (in %), smallest is 0 and largest 100.
head(washington_Bikes$Humidity)
## [1] 81 80 80 75 75 75



[TASK 4]

Our fourth task is to convert the Temperature field’s values to degree celsius At the moment the field Temperature in the two files are given two different measurements, we will have both in percentage after this task. Temperature in Washington file is normalized.

  • Value is normalized which means formula \([(value -Tmin)/Tmax-Tmin]\) was applied on it. [Tmin = -8, Tmax = 39]
  • We have to reverse this fomrula which means we have to apply formula $[(value)*(Tmax-Tmin)+Tmin] $. [Tmin = -8, Tmax = 39]
Implemantation idea
Using function mutate() from tidyverse package we can modify the existing columnn and its fields.

## Printing first 6 rows to show the value are now normalized.
head(washington_Bikes$Temperature)
## [1] 0.24 0.22 0.22 0.24 0.24 0.24
Tmin <- -8
Tmax <- 39
washington_Bikes <- washington_Bikes %>%
  mutate(Temperature = (Temperature)*(Tmax-Tmin)+Tmin) ## apply formula so we convert back to degree celsius.
## Printing first 6 rows to show values out of in degree celsius.
head(washington_Bikes$Temperature)
## [1] 3.28 2.34 2.34 3.28 3.28 3.28



[TASK 5]

Our fifth task is to convert the WindSpeed field’s values to m/s At the moment the field WindSpeed in the two files are given two different measurements, we will have both in percentage after this task. WindSpeed fieald in washinghton file is divided by 69km/s while in the Seoul file is in m/s

  • First we need to conver WindSpeed to km/s, so we multiply by 69 because its divided by 69km/x
  • After we apply the given formula \("Wind (m/s) = 0.2777778 × Wind (km/h)"\) formula was found found from goodcalculator.com
Implemantation idea
Using function mutate() from tidyverse package we can modify the existing columnn and its fields.

## Printing first 6 rows to show the value are in (km/s)/69
head(washington_Bikes$WindSpeed)
## [1] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0896
makeToKM <- 69
multiplayerConstant <- 0.2777778
washington_Bikes <- washington_Bikes %>%
  mutate(WindSpeed = ((WindSpeed)*(makeToKM))*multiplayerConstant) ## apply formula so we convert back to degree m/s
## Printing first 6 rows to show values are now in m/s
head(washington_Bikes$WindSpeed)
## [1] 0.000000 0.000000 0.000000 0.000000 0.000000 1.717333



[TASK 6]

Our sixth task is to change the values of field season. After convert the class to a factor and give the factor levels a specific order.

  • ‘1’ to ‘Winter’
  • ‘2’ to ‘Spring’
  • ‘3’ to ‘Summer’
  • ‘4’ to ‘Autumn’

The order the levels must follow are:

  • Spring > Summer > Autumn > Winter
Implemantation idea
Using function mutate() form tidyverse package and ifelse conditional statement we will change the fields values. After by using the mutate() function from tidyverse we can change the class of the field to a factor.

## Printing the 6 fist rows to show factor values before changing
head(washington_Bikes$Season)
## [1] 1 1 1 1 1 1
## Printing the class of Season to show its a 'integer'
class(washington_Bikes$Season)
## [1] "integer"
washington_Bikes <- washington_Bikes %>%
  mutate(Season = ifelse(Season == 1, "Winter", Season)) %>% ## Changing values to the appropriate factors Winter | Summer | Autumn | Spring.
  mutate(Season = ifelse(Season == 2, "Spring", Season)) %>% 
  mutate(Season = ifelse(Season == 3, "Summer", Season)) %>% 
  mutate(Season = ifelse(Season == 4, "Autumn", Season)) %>% 
  mutate(Season = factor(Season, levels = c("Spring", "Summer", "Autumn", "Winter"))) ## Changing type to factor with seq order.
## Printing the 6 fist rows to show factor values have changed.
head(washington_Bikes$Season)
## [1] Winter Winter Winter Winter Winter Winter
## Levels: Spring Summer Autumn Winter
## Printing the class of Season to show its a 'factor'
class(washington_Bikes$Season)
## [1] "factor"



[TASK 7]

Our seventh task is to convert the Holiday values to Yes or No. After we need to convert the type to factor, so the data can be categorzied.

  • 0 to ‘No’
  • 1 to ‘Yes’
Implemantation idea
Using function mutate() form tidyverse package and ifelse conditional statement we will change the fields values. After by using the mutate() function from tidyverse we can change the class of the field to a factor.

## Printing the 6 fist rows to show factor values before changing
head(washington_Bikes$Holiday)
## [1] 0 0 0 0 0 0
## Printing the class of Holiday to show its a 'character'
class(washington_Bikes$Holiday)
## [1] "integer"
washington_Bikes <- washington_Bikes %>%
  mutate(Holiday = ifelse(Holiday == 0, "No", "Yes")) %>% ## Changing values to Yes & No (0=No, 1=Yes).
  mutate(Holiday = factor(Holiday, levels = c("Yes", "No"))) ## Changing type to factor with seq order.
## Printing the 6 fist rows to show factor values have changed.
head(washington_Bikes$Holiday)
## [1] No No No No No No
## Levels: Yes No
## Printing the class of Holiday to show its a 'factor'
class(washington_Bikes$Holiday)
## [1] "factor"



[TASK 8]

Our eight task is to convert the class of field Date from ‘character’ to ‘Date’.

Implemantation idea
With the help of tidyverse function ‘as_date()’ we will change the type/class of the attribute to date.

## Printing class of column Date before mutation.
class(washington_Bikes$Date)
## [1] "character"
washington_Bikes <- washington_Bikes %>%
  convertDate("ymd")

## Printing class of column Date after mutation.
class(washington_Bikes$Date)
## [1] "Date"



[TASK 9]

Our ninth task is to add a new column with name ‘FullDate’ which will hold the Date and Hour of the day the bike was rented, in format ‘Y-M-D H:M:S’
Implemantation idea
With help of package lubridate we have a function called ‘make_datetime’ which does this action for us in the format asked.

## Printing column names to show that FullDate does not exist.
names(washington_Bikes)
## [1] "Date"        "Season"      "Hour"        "Holiday"     "Temperature"
## [6] "Humidity"    "WindSpeed"   "Count"
## Calling a self-declared function to apply this task.
washington_Bikes <- washington_Bikes %>%
  createFullDate()

## Printing column names to show creation of FullDate.
names(washington_Bikes)
## [1] "Date"        "Season"      "Hour"        "Holiday"     "Temperature"
## [6] "Humidity"    "WindSpeed"   "Count"       "FullDate"
## Printing first 6 rows of FullDate to show the value is in correct format.
head(washington_Bikes$FullDate)
## [1] "2011-01-01 00:00:00 UTC" "2011-01-01 01:00:00 UTC"
## [3] "2011-01-01 02:00:00 UTC" "2011-01-01 03:00:00 UTC"
## [5] "2011-01-01 04:00:00 UTC" "2011-01-01 05:00:00 UTC"


After applying all these methods on the data set, we managed to clean the data and bring it to a ‘tidy’ state. Now the data is easier to read and manipulate in plotting and statistical analysis. Additionally the data is now in the same state as Seoul file which will allow us to join/combile the data sets and explore different relationships

The result of Seoul, South Korea Data File after Data Wrangling is :
head(washington_Bikes)
##         Date Season Hour Holiday Temperature Humidity WindSpeed Count
## 1 2011-01-01 Winter    0      No        3.28       81  0.000000    16
## 2 2011-01-01 Winter    1      No        2.34       80  0.000000    40
## 3 2011-01-01 Winter    2      No        2.34       80  0.000000    32
## 4 2011-01-01 Winter    3      No        3.28       75  0.000000    13
## 5 2011-01-01 Winter    4      No        3.28       75  0.000000     1
## 6 2011-01-01 Winter    5      No        3.28       75  1.717333     1
##              FullDate
## 1 2011-01-01 00:00:00
## 2 2011-01-01 01:00:00
## 3 2011-01-01 02:00:00
## 4 2011-01-01 03:00:00
## 5 2011-01-01 04:00:00
## 6 2011-01-01 05:00:00



DATA VISUALISATION :

Air temperature variation over the course of a year

We will apply some visualisation and statistical analysis to compare the air temperature of both locations.
Due to my personal Laptop not being powerful enough i will be using only ggplot2 for the visualisation aspect rather than ggplotly. We will see both a point plot and boxplot for both different datasets.

## aligns the plot/figure in the center
## height/width gives size  in inches
ggplot(seoul_Bikes) +
  geom_point(aes(x = Date, y = Temperature), col="dark grey") +
  stat_smooth(aes(x = Date, y = Temperature)) + ## To see the distribution density
  xlab("Date") +  ## Naming x and y axes.
  ylab("Air Temperature (degrees celsius)") +
  ggtitle("Air Temperature variation of Seoul, Sount Korea") + ## Adding title to graph
  theme(plot.title = element_text(hjust = 0.5)) ## To align the title of the graph in the center, code fourd on stackoverflow :" https://stackoverflow.com/questions/40675778/center-plot-title-in-ggplot2 "

Using the point plot and a stat_smooth() we can view the Air Temperature’s distribution density of how the air temperature varies in Seoul, South Korea. There is a variety of temperature’s the warmest months are between may and august while the others get colder.

## The mean average of air temperature.
seoul_Bikes %>% 
  summarise(Mean=mean(Temperature))
##       Mean
## 1 12.77106
## The hottest day
seoul_Bikes %>% 
  summarise(Maximum=max(Temperature))
##   Maximum
## 1    39.4
## The coldest day.
seoul_Bikes %>% 
  summarise(Minimume=min(Temperature))
##   Minimume
## 1    -17.8

It can reach very hot days up to nearly 40 degrees cesius, but it can also be very cold nearly -18 degrees celsius.

ggplot(washington_Bikes) +
  geom_point(aes(x = Date, y = Temperature), col = "orange") +
  stat_smooth(aes(x = Date, y = Temperature)) + 
  xlab("Date") + 
  ylab("Air Temperature (degrees celsius)") +
  ggtitle("Air Temperature variation of Washinghton,DC, America") +
  theme(plot.title = element_text(hjust = 0.5))

Utilizing the point plot and a stat_smooth() we can view the Air Temperature’s distribution density of how the air temperature varies in Washinghton, DC. There is a similar density as Seoul. We can see the curve going up and back down two times. This is because the data in Washinghton’s is over two years rather than one year in Seou’s data.

## The mean average of air temperature.
washington_Bikes %>% 
  summarise(Mean=mean(Temperature))
##      Mean
## 1 15.3584
## The hottest day
washington_Bikes %>% 
  summarise(Maximum=max(Temperature))
##   Maximum
## 1      39
## The coldest day.
washington_Bikes %>% 
  summarise(Minimume=min(Temperature))
##   Minimume
## 1    -7.06

The average temperature in Washinghton is higher than Seoul. Both locations hotest days are close. But Seoul winter is a lot colder than Washinghton. We can see a difference of 10 degrees celsius.

Do seasons affect the average number of rented bikes?

ggplot(seoul_Bikes) +
  geom_boxplot(aes(x = Season, y = Count), col = "dark grey") +
  xlab("Season") + 
  ylab("Number of bikes rented") +
  ggtitle("Bikes Rented per Season in Seoul, South Korea") +
  theme(plot.title = element_text(hjust = 0.5))

From the boxplot graph above we can observe there is a significant drop of bikes rented in the winter. The highest renting season is summer, although autumn and spring are not far behind. We can conclude that in Seoul, the bikes rented a day is depended on the season.

ggplot(washington_Bikes) +
  geom_boxplot(aes(x = Season, y = Count), col = "orange") +
  xlab("Season") + 
  ylab("Number of bikes rented") +
  ggtitle("Bikes Rented per Season in Washinghton, DC, America") +
  theme(plot.title = element_text(hjust = 0.5))

From the boxplot graph above we can observe there is a significant drop of bikes rented in the winter as well in Washington. The highest renting season is summer again, although autumn and spring are not far behind. We can conclude that for Washington, the bikes rented a day is depended on the season.
Both locations are have a dramatic fall of rented bikes in the winter. We can conclude that the number of bikes of rented is depended on the season.

Do holidays increase or decrease the demand for rented bikes?

ggplot(seoul_Bikes) +
  geom_boxplot(aes(x = Holiday, y = Count), col = "dark grey") +
  xlab("Holiday") + 
  ylab("Number of bikes rented") +
  ggtitle("Bikes Rented per Holiday in Seoul, South Korea") +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(washington_Bikes) +
  geom_boxplot(aes(x = Holiday, y = Count), col = "orange") +
  xlab("Holiday") + 
  ylab("Number of bikes rented") +
  ggtitle("Bikes Rented per Holiday in Washinghton DC, America") +
  theme(plot.title = element_text(hjust = 0.5))

By observing the two above boxplot graphs, we can sum up that more bikes are rented when there is not a holiday. There is a bigger difference of the number of bikes rented in the Seoul dataset. We can see that when its not a holiday there is big difference to the number of rented bikes. While in the America dataset, the difference is smaller.
Maybe a variable that depends on holiday is that the residents need to work when there are no holidays. The way of trasnportation is by bike.

How does the time of day affect the demand for rented bikes?

grouped_seoul <- seoul_Bikes%>%
   group_by(Hour) %>% ##Grouping data in groups per hour.
   summarise(Average.Rented=mean(Count))  ## Finding average number of bikes rented per hour.

ggplot(grouped_seoul,aes(x = Hour, y = Average.Rented, fill = Hour)) +
  geom_bar(stat = "identity") +
  labs(colour = "Hour of Day") +
  xlab("Hour of Day") + 
  ylab("Number of bikes rented") +
  ggtitle("Average Bikes Rented per Hour in Seoul, South Korea") +
  theme(plot.title = element_text(hjust = 0.5))

Using the bar graph we can view on average how busy each hour is in Seou. The graph is plotted using the average number of bikes rented per day. There is a big drop between 4-5 o’clock in the morning. There is a significant rise at 8 o’ clock in the morning but falls again. We can view the demand of bikes start rising slowly slowly from 10 o’ clock in the morning and richest the peak busiest hour at 18 o’clock. After that the demand starts dropping again.
Using the mean demand of rented bikes per hour, our conclusion is that the busiest hours of the day is 8 in the mornig and 17-19 in the afternoon. The busiest hour is at 18:00 afternoon.

grouped_wash <- washington_Bikes%>%
   group_by(Hour) %>% ##Grouping data in groups per hour.
   summarise(Average.Rented=mean(Count))  ## Finding average number of bikes rented per hour.

ggplot(grouped_wash,aes(x = Hour, y = Average.Rented, fill = Hour)) +
  geom_bar(stat = "identity") +
  labs(colour = "Hour of Day") +
  xlab("Hour of Day") + 
  ylab("Number of bikes rented") +
  ggtitle("Average Bikes Rented per Hour in Washinghton Dc, America") +
  theme(plot.title = element_text(hjust = 0.5))

Using the bar graph we can view on average how busy each hour is Washington. The graph is plotted using the average number of bikes rented per day. There is a big drop between 3-5 o’clock in the morning. There is a significant rise at 8 o’ clock in the morning but falls again. We can view the demand of bikes start rising slowly slowly from 10 o’ clock in the morning and richest the peak busiest hour at 17 o’clock and 18 o’clock. After that the demand starts dropping again.
Using the mean demand of rented bikes per hour, our conclusion is that the busiest hours of the day is 8 in the morning and 16-18 in the afternoon. The busiest hour is at 15:00 afternoon.
There is a similarity in both locations of the demand distribution by hour. This might be because at 8-9 in the morning the popultaion starts work and 17-18 the population finishes from work. This could also be a reason that there is more demand on bikes when its not holiday, since the citizents need to go to work.

Is there an association between bike demand and the three meteorological variables (air temperature, wind speed and humidity)?


Seoul, South Korea

ggplot(seoul_Bikes) +
  geom_point(aes(x = Humidity, y = Temperature, size = Count, color = WindSpeed )) +
  stat_smooth(aes(x = Humidity, y = Temperature)) + 
  xlab("Humidity") + 
  ylab("Temperature") +
  ggtitle("Bikes rented based on the 3 meteorological variables, Seoul") +
  theme(plot.title = element_text(hjust = 0.5))

The above point plot is using all three meteorological variables to examine if the demand of bikes rented is depended to them. By the color of each point we can understand that as the Wind Speed grows higher then 4 m/s the number of rented bikes decreases. Additionally using the x and t axes it is noticed when Temperature and Humidity is low or high the number of rented bikes are less. We dont know if this observation is due to the combination or just one of the variable. Below we will review graphs for each meteorological attribute.

ggplot(seoul_Bikes) +
  geom_point(aes(x = Humidity, y = Count, size = Count) , col ="red") +
  stat_smooth(aes(x = Humidity, y = Count), method = "lm", col = "black") +
  stat_smooth(aes(x = Humidity, y = Count), col = "blue") +
  xlab("Humidity out of %") + 
  ylab("Number Rented Bikes") +
  ggtitle("Bikes rented based on Humidity") +
  theme(plot.title = element_text(hjust = 0.5))

The point plot above is showcasing the relationship between the number of bikes rented with the level of humidity. In the graph we have two lines that reveals the relationships.

  • Black line: Is a linear model which is telling us that as Humidity increases so does the number of rented bikes.
  • Blue line: Is a density distribution which is telling us that as humidity slowly increases so does the number of rented bikes, but when humidity starts going over 75% there drop in the number of bikes rented.
When Humidity is between (25-75)% that is when the number of bikes rented are high.

ggplot(seoul_Bikes) +
  geom_point(aes(x = WindSpeed, y = Count, size = Count) , col ="green") +
  stat_smooth(aes(x = WindSpeed, y = Count), method = "lm", col = "black") +
  stat_smooth(aes(x = WindSpeed, y = Count), col = "blue") +
  xlab("WindSpeed m/s") + 
  ylab("Number Rented Bikes") +
  ggtitle("Bikes rented based on WindSpeed") +
  theme(plot.title = element_text(hjust = 0.5))

The point plot above is showcasing the relationship between the number of bikes rented with the level of wind speed In the graph we have two lines that reveals the relationships.

  • Black line: Is a linear model which is telling us that as wind speed increases the number of rented bike are decreasing.
  • Blue line: Is a density distribution which is telling us that as wind speed slowly slowly increases the number bikes rented are increasing, but when the wind speed exceeds 4m/s there is a large dramatic fall to the number of bikes rented.
When wind speed is over 4m/s the number of bikes rented are limited.

ggplot(seoul_Bikes) +
  geom_point(aes(x = Temperature, y = Count, size = Count) , col ="orange") +
  stat_smooth(aes(x = Temperature, y = Count), method = "lm", col = "black") +
  stat_smooth(aes(x = Temperature, y = Count), col = "blue") +
  xlab("Temperature degrees Celsius") + 
  ylab("Number Rented Bikes") +
  ggtitle("Bikes rented based on Temperature") +
  theme(plot.title = element_text(hjust = 0.5))

The point plot above is showcasing the relationship between the number of bikes rented with the level of temperature In the graph we have two lines that reveals the relationships.

  • Black line: Is a linear model which is telling us that as Temperature becomes warmer the number of bikes rented are getting higher
  • Blue line: Is a density distribution which is telling us that as the Temperature is getting warmer the number of bikes rented are increasing, but when temperature goes over 30 degrees celsius the number of bikes rented are becoming less.
For very cold or hot Temperature’s the number of rented bikes are less.

From the above graph we can accept that each meteorological variable affects the number of bikes rented. There is a big difference of number of bikes rented when the wind speed is very strong or when the temperature and humidity is very low or very high.
When the temperature is between 0-20 degrees celsius, humidity is between 25-75 % and wind speed is under 4m/s, the number bikes rented are there highest.


#### Washington DC, America

ggplot(washington_Bikes) +
  geom_point(aes(x = Humidity, y = Temperature, size = Count, color = WindSpeed )) +
  stat_smooth(aes(x = Humidity, y = Temperature)) + 
  xlab("Humidity") + 
  ylab("Temperature") +
  ggtitle("Bikes rented based on the 3 meteorological variables, Washington DC") +
  theme(plot.title = element_text(hjust = 0.5))

wash

ggplot(washington_Bikes) +
  geom_point(aes(x = Humidity, y = Count, size = Count) , col ="red") +
  stat_smooth(aes(x = Humidity, y = Count), method = "lm", col = "black") +
  stat_smooth(aes(x = Humidity, y = Count), col = "blue") +
  xlab("Humidity out of %") + 
  ylab("Number Rented Bikes") +
  ggtitle("Bikes rented based on Humidity") +
  theme(plot.title = element_text(hjust = 0.5))

wash

ggplot(washington_Bikes) +
  geom_point(aes(x = WindSpeed, y = Count, size = Count) , col ="green") +
  stat_smooth(aes(x = WindSpeed, y = Count), method = "lm", col = "black") +
  stat_smooth(aes(x = WindSpeed, y = Count), col = "blue") +
  xlab("WindSpeed m/s") + 
  ylab("Number Rented Bikes") +
  ggtitle("Bikes rented based on WindSpeed") +
  theme(plot.title = element_text(hjust = 0.5))

wash

ggplot(washington_Bikes) +
  geom_point(aes(x = Temperature, y = Count, size = Count) , col ="orange") +
  stat_smooth(aes(x = Temperature, y = Count), method = "lm", col = "black") +
  stat_smooth(aes(x = Temperature, y = Count), col = "blue") +
  xlab("Temperature degrees Celsius") + 
  ylab("Number Rented Bikes") +
  ggtitle("Bikes rented based on Temperature") +
  theme(plot.title = element_text(hjust = 0.5))

wash

STATISTICAL ANALYSIS :

Linear Modelling Fitting

Seoul, South Korea

linear_model_log <- lm(log(Count) ~ Season + Humidity + Temperature + WindSpeed, data = seoul_Bikes)
summary(linear_model_log)
## 
## Call:
## lm(formula = log(Count) ~ Season + Humidity + Temperature + WindSpeed, 
##     data = seoul_Bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1073 -0.4281  0.0812  0.5493  2.4352 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.7336965  0.0467062 144.171  < 2e-16 ***
## SeasonSummer  0.0036038  0.0327843   0.110  0.91247    
## SeasonAutumn  0.3733211  0.0261578  14.272  < 2e-16 ***
## SeasonWinter -0.3830362  0.0349918 -10.946  < 2e-16 ***
## Humidity     -0.0224974  0.0004844 -46.441  < 2e-16 ***
## Temperature   0.0492700  0.0015053  32.732  < 2e-16 ***
## WindSpeed     0.0253809  0.0093544   2.713  0.00668 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8276 on 8458 degrees of freedom
## Multiple R-squared:  0.4941, Adjusted R-squared:  0.4937 
## F-statistic:  1377 on 6 and 8458 DF,  p-value: < 2.2e-16

Washinghton DC, America

linear_model_log <- lm(log(Count) ~ Season + Humidity + Temperature + WindSpeed, data = washington_Bikes)
summary(linear_model_log)
## 
## Call:
## lm(formula = log(Count) ~ Season + Humidity + Temperature + WindSpeed, 
##     data = washington_Bikes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4834 -0.6069  0.2458  0.8440  3.5203 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6264010  0.0576892  80.195  < 2e-16 ***
## SeasonSummer -0.3651680  0.0300276 -12.161  < 2e-16 ***
## SeasonAutumn  0.5361839  0.0289332  18.532  < 2e-16 ***
## SeasonWinter  0.1046103  0.0341346   3.065  0.00218 ** 
## Humidity     -0.0233425  0.0005317 -43.901  < 2e-16 ***
## Temperature   0.0797914  0.0017401  45.856  < 2e-16 ***
## WindSpeed     0.0237920  0.0043072   5.524 3.37e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.263 on 17372 degrees of freedom
## Multiple R-squared:  0.278,  Adjusted R-squared:  0.2777 
## F-statistic:  1115 on 6 and 17372 DF,  p-value: < 2.2e-16

Confident Intervals on Linear Models.

confint(linear_model_log, 'Humidity', level=0.97)
##                1.5 %      98.5 %
## Humidity -0.02449639 -0.02218851
## aligns the plot/figure in the center
## height/width gives size  in inches
## aligns the plot/figure in the center
## height/width gives size  in inches
## aligns the plot/figure in the center
## height/width gives size  in inches











Show a link name you wany yo show

## aligns the plot/figure in the center
## height/width gives size  in inches